LDA Topic Modeling for Reviews

Andy Karr

0. Introduction

This report investigates 12,415 reviews taken from 2017-2021 and segments them into topics using LDA Topic Modeling.

The purpose of this investigation is to find 3 to 7 topics (the best number of topics will be statistically investigated), to highlight what those topics are and to segment new data into these topics.

It was found from using coherence score analysis that 7 topics gives the highest coherence score and therefore it is most optimal to use 7 topics.

Some stopwords were removed from the cleaning of the text to allow for terms such as "not good" - normally words like "not" are removed from texts as this is a stop word but we want to retain the sentiment of "not good" being negative.

The report is split into the following sections.

  1. Preparation of the data.
  2. Building the model and exploring the created topics.
  3. Visualisation of the model.
  4. Using new reviews and assigning them to our created topics. I used 2 mock reviews for this as a demonstration and left instructions how to assign topics to new reviews further below in section 4.

1. Data Prep

1.1 Load libraries

1.2 Glance at the raw data

1.3 Frequency of reviews for each year¶

1.4 Data prep

The data is prepared by

  1. Taking out stopwords.
  2. Tokenisation (splitting the reviews into individual words).
  3. Taking bigrams and trigrams.

Some notes to make here are for stop words, some negative words such as "not" are normally taken out but for this analysis I believe it's important. For example, there is a review that said "not good". Normally, the stop word would take out the "not" part and leave out the "good". This means that the message would be interpreted as positive but it's actually not. For that reason I used a manual list of stop words and generally left in words like "not" or "couldn't" as I believe this is important for the analysis.

Equally, I included bigrams and trigrams. These are simple things. The default behaviour of this modeling is to use single words and use them for the analysis, for example if a review said "the bike is not good", each word, "the", "bike", "is", "not", "good" are used individually in the analysis (actually "the" and "is" are stop words and are removed). I have used bigrams which means we group 2 words together, therefore "not good" will be taken together as a word in the analysis. This captures the sentiment of it being bad. Trigrams work in a similar way but with 3 words, for example "not working well" will be grouped together and used in the model.

2. LDA Analysis

2.1 Run the model and look at the words that contribute most to each topic

The below output shows for each topic the words that contribute the most to that topic, for example "0" shows "not" contributing most to topic 0, "break" contributing most to topic 1, and so on. This will be presented clearer later on in the report.

2.2a Coherence score check

The following code chunk gives us the coherence score for 7 topics. We get a score of 0.39, which tells us if 7 topics is a good number of topics. This is a relatively low score. This can be explained by the fact there are many reviews with low amount of words, sometimes single words such as "top". With more words in each review, this score would increase. Given this fact, a coherence of 0.39 is acceptable.

2.2b Check what is the best number of topics

The following code gets the coherence score of each lda model run using topics in the range 3 to 7 to find which is the most optimal amount of topics. A high coherence score means a good number of topics. In this final report the code isn't run because it takes a long time and therefore I commented it out.

However, I did run this code previously and found 7 was the best number of topics. When I ran it the pre-processing of the data wasn't as good so we see a lower score than above for 7 topics, but 7 topics would still come out as the winner after these new pre-processing applications.

The image of this plot follows. It shows 7 topics has the highest coherence score and therefore we use 7 topics.

2.3 Create a table so we can see which topic is assigned to each document

The following table shows the assigned topic to every single document. Only the first 10 rows are shown for brevity. To see the whole table delete "head(10)" from the very bottom of the follow code chunk.

2.4 Table that shows the words that contribute the most to each topic, with the review that contributes the most to that topic

The following table shows for each topic ("Topic_Num") the most important words for that topic ("Keywords") and a sample review that contributes the most to that topic (Review = "Representative Text", review contribution = "Topic_Perc_Contribution")

3. Visualtion

3.1 Wordclouds

This chunk produces word clouds for each topic, showing more important words for each topic with a bigger size. These images are saved to your directory.

3.2 t-SNE plot

This shows the topics in 2 dimensions so we can see the similarity between topics. Over your mouse over each point to see some example reviews for that point.

3.3 pyLDAvis plot

This plot again shows the topics plotted in 2 dimensions. The distance between the topics is a measure of their similarity. If the topics are further away, the topics are less similar. Hover your mouse over each topic to see what the most important words are in that topic. You can also use the "Next Topic" button to look at each topic, since some topics are close together and it's difficult to pin point them with the mouse.

Important to note here that the topic numbers do not correspond with the topic numbers in the rest of the report. This is a weakness of pyLDAvis. You can decode the topics as the following:

pyLDAvis Topic = Original topic number (from above)
Topic 1 = Topic 3
Topic 2 = Topic 0
Topic 3 = Topic 2
Topic 4 = Topic 1
Topic 5 = Topic 4
Topic 6 = Topic 5
Topic 7 = Topic 6

3.4 Plot of changing topic proportions over time.

This plots shows how the volume of each topic changes over time.

4. Classifying new reviews into the topics.

This next section uses the model we created and inputs new reviews and then gives these reviews one of the 7 topics that we created. I used two mock reviews that I created to show which topic they are assigned to.

You can import new data, call it new_df and have the reviews take the column name "feedback_message" and substitute this importing of data into the following chunk. The next chunk (aka cell) created fake data. It can be substituted with real data.

4.1 The following table assigns each new review to a topic

The table shows that the first mock review is assigned to the topic 0, looking at the "Dominant_Topic" column. The second review is assigned to topic 2.